Author
Name Claire Descombes
Affiliation Universitätsklinik für Neurochirurgie, Inselspital Bern
Degree MSc Statistics and Data Science, University of Bern
Contact claire.descombes@insel.ch

The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.

1 Importing data

1.1 Working directory

When you want to load a file (e.g. a dataset), you have two options:

  • Put the file inside your R working directory (see below), so you don’t need to specify the full path.
  • Store the file somewhere else, and then always give the full (or partial) path when importing it.

If your file is stored in the folder that is currently set as your working directory, you can simply write:

setwd("C:/path/to/your/folder/")
data <- read.csv("testdata.csv")

# This is convenient because you can move the entire folder around without breaking your code: as long as you set the working directory to that folder when you open the script, everything still works.

If you want some structure inside your project (e.g. datasets stored in a subfolder “datasets”), you can use relative paths, which always start from the working directory:

setwd("C:/path/to/your/folder/")
data <- read.csv("datasets/testdata.csv")

# This is safe and portable: moving the whole folder keeps the relative paths valid.

If your files are scattered across your computer, you may prefer to specify the absolute path each time:

data <- read.csv("C:/some/other/folder/testdata.csv")

# This avoids having to change the working directory, but: the code breaks if the file moves, and the script is harder to share with others (everyone has different folder structures).

Working directory

To tell R which folder you are working in (e.g., where your data is stored), you have several options:

  • Go to the tab Session > Set Working Directory > Choose Directory and select your folder manually.
  • Use setwd("C:/path/to/your/folder") in your script/console.
  • Or, the most convenient for script-based work: go to Session > Set Working Directory > To Source File Location to automatically set the working directory to the location of your script.
# Command to display the current working directory
getwd()

# Command to manually set your working directory
setwd("C:/path/to/your/folder")

# Command to automatically set your working directory to the location of your R file
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) 

1.2 Import CSV and xlsx data

We will first look at how to import a CSV file into R as a data frame.

1.2.1 CSV

CSV stands for Comma-Separated Values. In a .csv file, the values are stored as plain text, separated by commas. This is a simple and widely used format for storing tabular data.

After setting your working directory or determining the path to your CSV file, you can use the read.csv() function to import the data. This will create a data frame, which is one of the most commonly used structures in R for handling datasets.

💡 I recommend to use data frames as data type for your data: they are generally easier to work with than matrices, especially for beginners.

# Import a CSV file into a data frame
dataset <- read.csv("C:/path/to/your/folder/data.csv")

The function read.csv() has several useful arguments:

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", row.names, 
         stringsAsFactors, ...)
  • header: A logical value (TRUE/FALSE) indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if the first row contains one fewer field than the number of columns.

  • sep: The field separator used in the file. For read.csv(), the default is a comma (,), which is standard for CSV files.

  • row.names: Specifies the row names of the data frame. It can be:

    • a vector of row names,
    • a number indicating the column to use as row names,
    • a character string naming the column with row names.

    If a header is present and the first row has one fewer field than the number of columns, the first column is used as row names. Otherwise, rows are automatically numbered. Use row.names = NULL to force default numbering.

  • col.names: Optional vector of column names. If not provided, default names like “V1”, “V2”, etc., are assigned.

  • stringsAsFactors: TRUE/FALSE; should character vectors be converted to factors?

1.2.2 XLS

Another widely used data format is the Excel file (.xlsx or .xls). For these, you can use the readxl package to import the data:

# Load the readxl package (after installing it)
library(readxl)

# Read the first sheet of an Excel file
dataset <- read_excel("C:/path/to/your/folder/data.xlsx")

The function read_excel() also has several useful arguments:

read_excel(path, sheet = NULL, range = NULL,
  col_names = TRUE, col_types = NULL, na = "",...
)
  • path: Path to the xls/xlsx file.

  • sheet: Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Ignored if the sheet is specified via range. If neither argument specifies the sheet, defaults to the first sheet.

  • range: A cell range to read from, as described in cell-specification. Includes typical Excel ranges like “B3:D87”, possibly including the sheet name like “Budget!B2:G14”, and more.

  • col_names: TRUE to use the first row as column names, FALSE to get default names, or a character vector giving a name for each column.

  • col_types: Either NULL to guess all from the spreadsheet or a character vector containing one entry per column from these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”.

  • na: Character vector of strings to interpret as missing values. By default, readxl treats blank cells as missing data.

⚠️ Note: If your file is actually a CSV but mistakenly has a .xlsx extension, you should rename it to .csv and use read.csv() instead. Mixing up formats can lead to import errors.

1.2.3 Load the NHANES data sets

Let us now look at real data frames to learn how to call or modify their elements. To do this, we will use multiple health data sets from the National Health and Nutrition Examination (NHANES) Survey from 2011-2012. The survey assessed overall health and nutrition of adults and children in the United States and was conducted by the National Center for Health Statistics (NCHS). The data sets can be found in the data_sets folder. More details on these data sets can be found in the Appendix A.

1.3 Exercises

1.3.1 Exercise 1

✏️ Exercise 1: import the demo, bpx, bmx and smq data sets from the data_sets folder into R.

2 Handling data

2.1 Base R vs tidyverse

Base R, without any additional packages, already provides many functions that are very handy for data handling. However, some contributed packages make data preparation much easier and more readable.

I’ll introduce two such packages here, before diving into concrete data handling examples. Both are part of a larger and very powerful collection of packages for data science called the tidyverse, which I use for nearly all my analyses.

💡 In the Appendix B, you will find a table containing useful functions from both Base R and the tidyverse that facilitate efficient data handling.

2.1.1 The pipe

One of the most downloaded contributed extension packages of all times is magrittr. It provides a very useful operator, the forward pipe operator %>%, which passes the object on its left as the first argument to the function on its right. This is much easier to understand with an example.

# The easiest way to get magrittr is to install the whole tidyverse
install.packages("tidyverse")
# Once installed, a package has to be loaded to be used
library(tidyverse)
library(tidyverse)

# Let's do the same operation twice: once using the pipe, once without

# No pipe:
str(c(1,2,3,4))
##  num [1:4] 1 2 3 4
# With pipe:
c(1,2,3,4) %>%
  str()
##  num [1:4] 1 2 3 4
# Not too exciting yet, but consider a more complex case:
summary(log(sqrt(na.omit(c(1, 4, NA, 16, 25)))))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5199  1.0397  0.9222  1.4421  1.6094
# With the pipe, we can rewrite this more readably:
c(1, 4, NA, 16, 25) %>%
  na.omit() %>%
  sqrt() %>%
  log() %>%
  summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5199  1.0397  0.9222  1.4421  1.6094

The pipe helps turn nested function calls into a sequence of simpler, linear steps. This makes code easier to read, write, and debug. The pipe becomes especially powerful when used with functions from the dplyr package for data manipulation.

2.1.2 dplyr

Another helpful R package is dplyr. It is a grammar of data manipulation, providing a consistent set of verbs that helps solve the most common data manipulation challenges.

Let’s illustrate this with a simple example. Our goal: Group the cars dataset (contained in base R) by speed groups (e.g. low/medium/high), and for each group, compute (1) the average stopping distance and (2) the number of observations.

# Base R (no dplyr, no pipe)

cars$speed_group <- cut(cars$speed, breaks = c(0, 10, 20, 30), 
                        labels = c("Low", "Medium", "High"))

avg_dist <- aggregate(dist ~ speed_group, data = cars, mean)
n_obs <- aggregate(dist ~ speed_group, data = cars, length)
names(n_obs)[2] <- "n"

summary_df <- merge(avg_dist, n_obs, by = "speed_group")
summary_df
# With dplyr, no pipe:

cars <- mutate(cars, speed_group = cut(speed, breaks = c(0, 10, 20, 30), labels = c("Low", "Medium", "High")))

summary_df <- summarise(group_by(cars, speed_group),
                        avg_dist = mean(dist),
                        n = n())
summary_df
# With dplyr and the pipe
cars %>%
  mutate(speed_group = cut(speed, breaks = c(0, 10, 20, 30), 
                           labels = c("Low", "Medium","High"))) %>%
  group_by(speed_group) %>%
  summarise(
    avg_dist = mean(dist),
    n = n()
  )

💡 cut(x, ...) divides the range of x into intervals (the breaks) and codes the values in x according to which interval they fall. labels are the levels of the resulting category. If labels = FALSE, simple integer codes are returned instead of a factor.

As you can see, using dplyr and the pipe can make your life much easier.

In the following chapter, we’ll use both base R and tidyverse functions without always noting which package they belong to. If you’re ever unsure, you can check the top-left corner of the function’s help page.

2.2 Accessing elements in data frames

Being able to access elements in a data frame is essential when working with data. Here are some common methods to select specific elements, rows, or columns.

# Look at the first respectively last few rows
head(demo)
tail(demo)
# Select columns by name
demo[, c("RIDAGEYR", "RIAGENDR")]  # Selecting age in years and gender
vars <- c("RIDAGEYR", "RIAGENDR")
demo[, vars]  # Alternative using variable `vars`
# Select elements by position
demo[1, 1]  # Access the first element of the first column (the respondent sequence number of the 1st participant)
## [1] 62161
ind_mat <- cbind(c(1, 3, 5), c(2, 4, 6))
demo[ind_mat]  # Access rows and columns using multiple indices
## [1] "NHANES 2011-2012 public release" "Male"                           
## [3] NA
# Select rows based on a condition
head(demo[, "RIDAGEYR"] > 50)  # Logical condition for age greater than 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(!(demo[, "DMDHHSIZ"] > 3))  # Logical negation for total number of people in the household not greater than 3
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
demo[demo[, "RIDAGEYR"] > 50, ]  # Rows where age > 50
demo[demo[, "DMDHHSIZ"] < 3, ]  # Rows where total number of people in the household greater than 3
demo[demo[, "DMDHHSIZ"] >= 3, ]  # Rows where total number of people in the household greater or equal 3
# Combine logical vectors using "&" (AND), "|" (OR), and "!" (NOT)
demo[(demo[, "RIDAGEYR"] > 50 & demo[, "RIAGENDR"] == "Female"), ]  # Both conditions must be true
demo[(demo[, "DMDHHSIZ"] < 3 | demo[, "RIAGENDR"] == "Male"), ]  # One condition must be true

💡 To inspect one column, you can also use the dollar $ symbol to access a column as a vector.

head(demo$RIDAGEYR)  # Returns the age column as a vector
## [1] 22  3 14 44 14  9

💡 You can use the brackets [] to select specific rows and columns. Since data frames are bi-dimensional, the first index refers to rows and the second to columns. To select a particular column, you can omit the row index. To select a particular row, omit the column index.

head(demo[, "RIDAGEYR"])  # All rows in the age column
## [1] 22  3 14 44 14  9
demo[1, ]  # Row 1 (all columns)

2.2.1 dplyr syntax

With dplyr, you can achieve the same things more readably, especially for filtering rows and selecting columns.

# Select columns
select(demo, RIDAGEYR, RIAGENDR)
# Filter rows based on a condition
filter(demo, RIDAGEYR > 50)
# Combine filtering and selecting
filter(demo, RIDAGEYR > 50 & RIAGENDR == "Female") %>%
  select(RIDAGEYR, RIAGENDR)

💡 Using the pipe (%>%), the structure becomes more linear: first say what data you’re working on, then what to do with it.

demo %>%
  filter(DMDHHSIZ < 3 | RIAGENDR == "Male") %>%
  select(DMDHHSIZ, RIAGENDR)

This syntax avoids nested brackets and makes your code easier to read, especially as operations grow more complex.

2.3 Basic descriptive statistics

R makes it simple to compute basic descriptive statistics for exploring your dataset. Below are a few useful examples.

2.3.1 Central tendency: mean and median

mean(demo$RIDAGEYR, na.rm = TRUE)                 # Average (mean) age of participants
## [1] 31.40262
median(demo$DMDHHSIZ, na.rm = TRUE)               # Median household size
## [1] 4

💡 The na.rm argument in those functions allows for ignoring the NA values.

With dplyr, you can combine summarise() with the pipe for clearer syntax:

demo %>%
  summarise(mean_age = mean(RIDAGEYR, na.rm = TRUE),
            median_household_size = median(DMDHHSIZ, na.rm = TRUE))

2.3.2 Dispersion: standard deviation, min, max, and range

sd(demo$RIDAGEYR, na.rm = TRUE)                     # Standard deviation of age
## [1] 24.57899
range(demo$DMDHHSZA, na.rm = TRUE)                  # Range of number of young children
## [1] 0 3
min(demo$DMDHRAGE, na.rm = TRUE)                    # Minimum age of household reference person
## [1] 18
max(demo$DMDHRAGE, na.rm = TRUE)                    # Maximum age of household reference person
## [1] 80

Also possible with summarise():

demo %>%
  summarise(sd_age = sd(RIDAGEYR, na.rm = TRUE),
            min_age = min(DMDHRAGE, na.rm = TRUE),
            max_age = max(DMDHRAGE, na.rm = TRUE))

2.3.3 Frequency tables and proportions

table(demo$RIAGENDR)                                # Gender
## 
## Female   Male 
##   4900   4856
table(demo$RIDRETH1)                                # Race/Hispanic origin
## 
##                    Mexican American                  Non-Hispanic Black 
##                                1355                                2683 
##                  Non-Hispanic White                      Other Hispanic 
##                                2973                                1076 
## Other Race - Including Multi-Racial 
##                                1669
prop.table(table(demo$DMDCITZN))                    # Citizenship status (proportions)
## 
## Citizen by birth or naturalization            Not a citizen of the US 
##                        0.890677879                        0.106655728 
##                            Refused 
##                        0.002666393

With dplyr, you get frequencies directly:

demo %>%
  count(RIAGENDR)
demo %>%
  count(AIALANGA) %>%
  mutate(prop = n / sum(n))  # Proportional distribution

💡 n() gives the current group size.

2.3.4 Group-wise summaries

aggregate(DMDHRAGE ~ DMDHRMAR, data = demo, FUN = mean, na.rm = TRUE)   # Mean age of household reference person by marital status
aggregate(DMDHHSIZ ~ DMDHRGND, data = demo, FUN = median, na.rm = TRUE) # Median household size by gender of reference person

Or using dplyr:

demo %>%
  group_by(DMDHRMAR) %>%
  summarise(mean_age = mean(DMDHRAGE, na.rm = TRUE))
demo %>%
  group_by(DMDHRGND) %>%
  summarise(median_household_size = median(DMDHHSIZ, na.rm = TRUE))

2.3.5 Full overview

summary(demo)[,1:5] # only for the first 5 variables
##       SEQN         SDDSRVYR           RIDSTATR           RIAGENDR        
##  Min.   :62161   Length:9756        Length:9756        Length:9756       
##  1st Qu.:64600   Class :character   Class :character   Class :character  
##  Median :67039   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :67039                                                           
##  3rd Qu.:69477                                                           
##  Max.   :71916                                                           
##                                                                          
##     RIDAGEYR   
##  Min.   : 0.0  
##  1st Qu.: 9.0  
##  Median :26.0  
##  Mean   :31.4  
##  3rd Qu.:52.0  
##  Max.   :80.0  
## 

2.4 Modifying elements in data frames

Now, let us assume we want to modify/add/remove one or multiple entries/rows/columns in our data frame. The brackets really come in handy now. In this setting, I recommend defining a new data frame before modifying the original one.

Some examples follow.

# Modify one entry:
demo_mod <- demo  # Create a copy to avoid modifying the original data set
demo_mod[1, 1:5]
demo_mod[1, "RIAGENDR"] <- 'Female' # Change gender of the first participant
demo_mod[1, 1:5]
# Modify multiple entries based on a condition:
demo_mod[1:10, 1:5]
demo_mod[!is.na(demo_mod$RIDAGEYR) & demo_mod$RIDAGEYR < 18, ]$RIDAGEYR <- 18  # Set minimum age to 18
demo_mod[1:10, 1:5]

2.4.1 dplyr syntax

With mutate() and case_when() from dplyr, the logic becomes slightly more readable, especially for multiple conditions:

demo_mod <- demo %>%
  mutate(RIDAGEYR = case_when(
    !is.na(RIDAGEYR) & RIDAGEYR < 18 ~ 18,
    TRUE ~ RIDAGEYR  # leave other values unchanged
  ))

💡 case_when() is like a vectorised ifelse() on steroids — it’s handy for multiple conditions and clear logic.

2.5 Dealing with NAs

Handling missing data (NAs) is a common task in data analysis. Before deciding how to treat them, it’s important to understand where and how often they occur.

colSums(is.na(demo))              # Number of NAs per column
##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 
##        0        0        0        0        0     9130        0        0 
## RIDEXMON RIDEXAGY RIDEXAGM DMQMILIZ  DMQADFC DMDBORN4 DMDCITZN DMDYRSUS 
##      418     6338     5747     3749     9205        0        5     7683 
## DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG  SIALANG SIAPROXY SIAINTRP  FIALANG 
##     7157     4196     4196     8548        0        6        0      105 
## FIAPROXY FIAINTRP  MIALANG MIAPROXY MIAINTRP AIALANGA WTINT2YR WTMEC2YR 
##      105      105     3043     3043     3043     4002        0        0 
##  SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR DMDHHSIZ DMDFMSIZ DMDHHSZA 
##        0        0       81       51      840        0        0        0 
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU 
##        0        0        0        0      365      362      136     4881
sum(complete.cases(demo))         # Number of rows without any NAs
## [1] 0

💡 is.na() returns a logical matrix where TRUE indicates a missing value (NA) and FALSE indicates a non-missing value. colSums() takes this logical matrix and sums up the TRUE values (which are treated as 1), giving you the count of missing values for each column.

💡 complete.cases() returns a logical vector: TRUE if a row has no missing values, and FALSE otherwise. Using sum(complete.cases(...)) counts the number of rows with no missing data.

One way to handle missing data is to remove rows containing NAs for the variable(s) you are interested in. This can be appropriate in some cases, but it should be done with care, as it may introduce bias or reduce sample size. We’ll discuss this further in Chapter 5.

# Remove rows with any missing values in the DMDHRMAR column
demo_DMDHRMAR <- demo[!is.na(demo$DMDHRMAR), ]

# Check for missing values
sum(is.na(demo_DMDHRMAR$DMDHRMAR))
## [1] 0

You can remove all rows with missing values across any of the columns in the dataset using the function na.omit().

# Remove rows with missing values in any column
demo_no_na <- na.omit(demo)

# Check the resulting data frame and its structure
head(demo_no_na)

💡 For our demo data set, this removes all the rows! Another reminder to be very careful when removing NA values.

2.5.1 dplyr syntax

The filter() function makes row-wise filtering more intuitive:

# Keep only rows where DMDHRMAR is not missing
demo_DMDHRMAR <- demo %>%
  filter(!is.na(DMDHRMAR))

# Remove all rows with any missing values
demo_no_na <- demo %>%
  filter(complete.cases(.))

💡 Using filter() keeps the logic clear and readable, especially when chaining multiple conditions with the pipe.

2.6 Exercises

2.6.1 Exercise 2

✏️ Exercise 2: inspect the structure of the demo data set, look at different entries and familiarise yourself with the commands. Here are a few possible tasks, but try doing a few more of your own.

  • View the first few rows of the dataset

  • Display the names of all columns

  • View all unique values in the race column

  • Count how many people are married in the dataset (absolute counts and proportions)

  • Get a quick statistical summary of a few columns (e.g. age, total number of people in the family and age of the household reference person)

  • View the number of missing values in each column

  • Calculate the average age of the cohort

  • Calculate the median total number of people in the household by race

2.6.2 Exercise 3

✏️ Exercise 3: generate a new data frame selecting only the female patients that are above 18 years old and that took the ACASI interview in Spanish.

3 Combining data

In practice, data is often spread across multiple data frames that need to be combined. Depending on the structure and goal, there are different ways to combine data frames:

3.1 Column binding

To add columns side-by-side, the data frames must have the same number of rows.

# Extract one column from demo to create an additional data frame with the same number of rows
extra_info <- demo$RIDRETH1

# Combine using cbind
combined_df <- cbind(demo, extra_info)
combined_df[1, ]
# Add a new column directly to demo with mutate() from dplyr
combined_df <- demo %>%
  mutate(extra_info = RIDRETH1)

3.2 Row binding

To stack data frames vertically, the data frames must have the same column names and types.

# Extract one column from demo to create an additional data frame with the same structure (column names and types)
(new_participant <- demo[18,])
# Combine using rbind
extended_df <- rbind(demo, new_participant)
extended_df[nrow(extended_df),]
# Alternative with dplyr (bind_rows())
extended_df <- bind_rows(demo, new_participant)
extended_df %>%
  slice(n())  # Show last row

💡 n() gives the current group size.

3.3 Merging / Joins

merge() combines data frames based on a common column, similar to SQL joins (see figure below for a reminder on the different types of joins).

# Merge two data frames by participant ID `SEQN` (inner join by default)
merged_demo_bpx <- merge(demo, bpx, by = "SEQN")

# In base R, you can use the Reduce() function to iteratively merge a list of data frames
data_list <- list(demo, bpx, bmx) # list of data frames to merge
merged_demo_bpx_bmx <- Reduce(function(x, y) merge(x, y, by = "SEQN"), data_list) # perform an inner join on all data frames by 'SEQN'

You should specify the type of join you are looking for:

  • Inner join (default): keep only rows with matching SEQN in both data frames: inner_join_df <- merge(demo, bmx, by = "SEQN")
  • Left (outer) join: keep all rows from demo and only matching rows from bmx: left_join_df <- merge(demo, bmx, by = "SEQN", all.x = TRUE)
  • Right (outer) join: keep all rows from bmx and only matching rows from demo: left_join_df <- merge(demo, bmx, by = "SEQN", all.y = TRUE)
  • Full (outer) join: keep all rows from both demo and bmx, filling with NA where there are no matches: full_join_df <- merge(demo, bmx, by = "SEQN", all = TRUE)

💡 merge() in base R is flexible but can be a bit verbose. For simpler syntax, you can also use the dplyr functions left_join(), right_join(), inner_join(), and full_join().

# Inner join
inner_join_df <- demo %>%
  inner_join(bmx, by = "SEQN")

# Left join
left_join_df <- demo %>%
  left_join(bmx, by = "SEQN")

# Right join
right_join_df <- demo %>%
  right_join(bmx, by = "SEQN")

# Full join
full_join_df <- demo %>%
  full_join(bmx, by = "SEQN")

3.4 Exercises

3.4.1 Exercise 4

✏️ Exercise 4: merge the 4 data sets demo, bpx, bmx and smq into a single data set merged_nhanes by performing an inner join on the SEQN ID and using the dplyr package.

4 Saving data

You can save data sets in different formats, depending on how you want to use them later:

  • CSV – human-readable and widely supported (e.g. by Excel, Python, etc.)
  • RDS – R-specific format for saving a single R object (more compact than CSV, preserves data types)
  • RData – used to save multiple R objects into a single file

Let us save a data frame in all those different formats.

# Specify the directory for the data
directory <- "C:/GitHub/rforphysicians/data_sets"

# Save as CSV
write.csv(merged_demo_bpx_bmx, file = file.path(directory,"merged_demo_bpx_bmx.csv"), row.names = FALSE)

# Save as RDS (recommended for re-loading in R)
saveRDS(merged_demo_bpx_bmx, file = file.path(directory,"merged_demo_bpx_bmx.rds"))

# Save as RData (can contain multiple objects)
save(merged_demo_bpx_bmx, file = file.path(directory,"merged_demo_bpx_bmx.RData"))

💡 file.path() constructs the path to a file from components in a platform-independent way. It concatenates paths and filenames, and automatically uses the correct file separator (/ on Unix/macOS, \ on Windows), making your code more robust and portable.

You can load these later using:

# Read CSV
df_csv <- read.csv(file.path(directory, "merged_demo_bpx_bmx.csv"))

# Alternative: use readr for CSV (from tidyverse package)
df_csv2 <- read_csv(file.path(directory, "merged_demo_bpx_bmx.csv"))

# Read RDS
df_rds <- readRDS(file.path(directory, "merged_demo_bpx_bmx.rds"))

# Load RData (loads objects into the environment)
load(file.path(directory, "merged_demo_bpx_bmx.RData"))

💡 When loading RData, check which objects got loaded with ls() (returns a vector of character strings giving the names of the objects in the specified environment).

4.1 Exercises

4.1.1 Exercise 5

✏️ Exercise 5: Save the data frame resulting from Exercise 4 in your working directory in both .csv and .rds formats.

5 Solutions to the exercises

Please note that those are only examples, there are always many ways to solve the same task!

5.1 Exercise 1

☑️ Exercise 1:

Start by downloading the different CSV files on your laptop, ideally in the same folder as your script.

# Load the necessary CSV files into data frames

directory <- "C:/GitHub/rforphysicians/data_sets/" # Your files' directory here

demo <- read.csv(paste0(directory,"DEMO_G.csv"))  # Demographics (cycle G = 2011–2012)
bpx  <- read.csv(paste0(directory,"BPX_G.csv"))   # Blood pressure
bmx  <- read.csv(paste0(directory,"BMX_G.csv"))   # Body measures
smq  <- read.csv(paste0(directory,"SMQ_G.csv"))   # Smoking questionnaire

# Or simply set your working directory to "source file location", then you do not even have to precise the directory
# e.g. demo <- read.csv("DEMO_G.csv")

5.2 Exercise 2

☑️ Exercise 2: Did you manage to select a specific column you were interested in? Were you able to check the number or proportions of entries for a few variables? Be creative :)

Here are solutions for the few example tasks.

# View the first few rows of the dataset
head(demo)
# Display the names of all columns
names(demo)
##  [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
##  [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGY" "RIDEXAGM" "DMQMILIZ"
## [13] "DMQADFC"  "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2"
## [19] "DMDMARTL" "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG" 
## [25] "FIAPROXY" "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA"
## [31] "WTINT2YR" "WTMEC2YR" "SDMVPSU"  "SDMVSTRA" "INDHHIN2" "INDFMIN2"
## [37] "INDFMPIR" "DMDHHSIZ" "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE"
## [43] "DMDHRGND" "DMDHRAGE" "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU"
# View all unique values in the race column
unique(demo$RIDRETH1)
## [1] "Non-Hispanic White"                  "Mexican American"                   
## [3] "Other Race - Including Multi-Racial" "Non-Hispanic Black"                 
## [5] "Other Hispanic"
# Count how many people are married in the dataset
# (absolute counts and proportions)
married <- demo %>%
  mutate(DMDMARTL = ifelse(DMDMARTL == "Married", "married", "other")) %>%
  select(DMDMARTL)
table(married)
## DMDMARTL
## married   other 
##    2683    2877
prop.table(table(married))
## DMDMARTL
##  married    other 
## 0.482554 0.517446
# Get a quick statistical summary of a few columns 
# (here age, total number of people in the family and age of the household reference person)
summary(demo[,c('RIDAGEYR','DMDFMSIZ','DMDHRAGE')])
##     RIDAGEYR       DMDFMSIZ        DMDHRAGE    
##  Min.   : 0.0   Min.   :1.000   Min.   :18.00  
##  1st Qu.: 9.0   1st Qu.:2.000   1st Qu.:33.00  
##  Median :26.0   Median :4.000   Median :43.00  
##  Mean   :31.4   Mean   :3.591   Mean   :45.39  
##  3rd Qu.:52.0   3rd Qu.:5.000   3rd Qu.:56.00  
##  Max.   :80.0   Max.   :7.000   Max.   :80.00
# View the number of missing values in each column
colSums(is.na(demo))
##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 
##        0        0        0        0        0     9130        0        0 
## RIDEXMON RIDEXAGY RIDEXAGM DMQMILIZ  DMQADFC DMDBORN4 DMDCITZN DMDYRSUS 
##      418     6338     5747     3749     9205        0        5     7683 
## DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG  SIALANG SIAPROXY SIAINTRP  FIALANG 
##     7157     4196     4196     8548        0        6        0      105 
## FIAPROXY FIAINTRP  MIALANG MIAPROXY MIAINTRP AIALANGA WTINT2YR WTMEC2YR 
##      105      105     3043     3043     3043     4002        0        0 
##  SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR DMDHHSIZ DMDFMSIZ DMDHHSZA 
##        0        0       81       51      840        0        0        0 
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU 
##        0        0        0        0      365      362      136     4881
# Calculate the average age
mean(demo$RIDAGEYR, na.rm = TRUE)
## [1] 31.40262
# Calculate the median total number of people in the household by race
demo %>%
  group_by(RIDRETH1) %>%
  summarise(median_people_household = median(DMDHHSIZ, na.rm = TRUE))

5.3 Exercise 3

☑️ Exercise 3:

# Filter the dataset to include only female patients that are above 18 years old and that took the ACASI interview in Spanish.
demo_filtered <- demo[demo$RIAGENDR == "Female" & 
                      demo$RIDAGEYR > 17 & 
                      !is.na(demo$AIALANGA) & demo$AIALANGA == "Spanish", ]
ind <- c('RIAGENDR', 'RIDAGEYR', 'AIALANGA')
demo_filtered[, ind]

💡 Note: When filtering on a variable that contains missing values (e.g., AIALANGA), you must explicitly exclude NAs using !is.na(...). This is because comparisons like demo$AIALANGA == “Spanish” return NA for missing values, not FALSE, so those rows aren’t properly excluded from the subset.

5.4 Exercise 4

☑️ Exercise 4:

# If necessary, install the dplyr package
install.packages("dplyr")
# Load the dplyr package
library(dplyr)

# Merge the datasets using inner joins by SEQN
data_list <- list(demo, bpx, bmx, smq)
merged_nhanes <- Reduce(function(x, y) inner_join(x, y, by = "SEQN"), data_list)

5.5 Exercise 5

☑️ Exercise 5:

# Save the merged dataset in your working directory
# Click on Session > Set Working Directory > To Source File Location, or use file.path()

results_directory <- "C:/GitHub/rforphysicians/data_sets"
write.csv(merged_nhanes, file.path(results_directory, "merged_nhanes.csv"), row.names = FALSE)
saveRDS(merged_nhanes, file.path(results_directory, "merged_nhanes.rds"))

References

Alexander Henzi. 2021. “Programming and Data Analysis with R.” Lecture notes.
Burns, Patrick. n.d. The R Inferno. Accessed May 8, 2025. https://www.burns-stat.com/documents/books/the-r-inferno/.
CDC. 2025. “National Death Index.” Data Linkage. https://www.cdc.gov/nchs/linked-data/mortality-files/index.html.
ChatGPT.” n.d. Accessed January 26, 2025. https://chatgpt.com.
Christopher J. Endres. 2025. “Introducing nhanesA.” https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html.
“Create Elegant Data Visualisations Using the Grammar of Graphics.” n.d. Accessed January 26, 2025. https://ggplot2.tidyverse.org/.
David, Author. 2016. BIRT Joins.” MBSE Chaos. https://mbsechaos.wordpress.com/2016/05/24/birt-joins/.
Elena Kosourova. n.d. RStudio Tutorial for Beginners: A Complete Guide.” Accessed January 26, 2025. https://www.datacamp.com/tutorial/r-studio-tutorial.
Ghasemi, Asghar, and Saleh Zahediasl. 2012. “Normality Tests for Statistical Analysis: A Guide for Non-Statisticians.” International Journal of Endocrinology and Metabolism 10 (2): 486–89. https://doi.org/10.5812/ijem.3505.
Grolemund, Hadley Wickham and Garrett. n.d. R for Data Science. Accessed May 8, 2025. https://r4ds.had.co.nz/introduction.html.
Mayer, Michael. 2025. “Mayer79/Statistical_computing_material.” https://github.com/mayer79/statistical_computing_material.
Patrick Burns. n.d. Impatient R. Accessed May 8, 2025. https://www.burns-stat.com/documents/tutorials/impatient-r/.
P-Value.” 2025. Wikipedia. https://en.wikipedia.org/w/index.php?title=P-value&oldid=1305292611.
“Synthetic Dataset for AI in Healthcare.” n.d. Accessed May 9, 2025. https://www.kaggle.com/datasets/smmmmmmmmmmmm/synthetic-dataset-for-ai-in-healthcare.
“The Comprehensive R Archive Network.” n.d. Accessed January 26, 2025. https://stat.ethz.ch/CRAN/.
W. N. Venables, D. M. Smith and the R Core Team. n.d. “An Introduction to R.” Accessed May 8, 2025. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.
Wickham, Hadley. n.d. Advanced R. Accessed May 8, 2025. https://adv-r.hadley.nz/introduction.html.